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Abstract 

Quantitative criteria are proposed to identify genes (and sets of genes) wliose expression marks a specific 
brain region (or a set of brain regions). Gene-expression energies, obtained for thousands of mouse 
genes by numerization of in situ hybridization images in the Allen Gene Expression Atlas, are used to 
test these methods in the mouse brain. Individual genes are ranked using integrals of their expression 
energies across brain regions. The ranking is generalized to sets of genes and the problem of optimal 
markers of a classical region receives a linear-algebraic solution. Moreover, the goodness of the fitting 
of the expression profile of a gene to the profile of a brain region is closely related to the co-expression 
of genes. The geometric interpretation of this fact leads to a quantitative criterion to detect markers of 
pairs of brain regions. Local properties of the gene-expression profiles arc also used to detect genes that 
separate a given grain region from its environment. 
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1 Introduction 

Neuroanatomy is experiencing a renaissance under the influence of molecular biology and computational 
methods. Brain regions can be delineated on stained sections of brain tissue. The set of boundaries 
between brain regions defined on sections can be registered in order to obtain a three-dimensional at- 
las. Conflicts exist between the various nomenclatures of brain regions. The present paper will consider 
brain regions defined by classical anatomy as in the Allen Reference Atlas [T]. Gene-expression ener- 
gies are positive quantities defined at every point in the brain (or rather at every cubic voxel of side 
equal to the resolution, which is 200 microns in the present paper). With contemporary techniques of in 
situ hybridization, such data were produced by the Allen Institute for thousands of genes in the mouse 
brain [2j[3j. This makes the ISH data much higher-dimensional than classical neuroanatomy. Given an 
anatomical atlas, it is therefore natural to ask if the patterns formed by the expression energy of single 
genes and/or sets of genes can delineate and/or separate brain regions. 



The structure of the paper is as follows. We will first formalize the notion of marker genes by defining 
quantitative criteria that allow to rank individual genes by computing scores. The localization score 
measures how much of the expression energy of a gene is contained in the region of interest. The fitting 
score measures how close the expression-energy profile is to the characteristic function of the region. 
The associated rankings of genes are computed. The genes ranked as the top few markers make sense 
optically, but there are confiicts between the two rankings. The two criteria are then used to rank sets of 
genes as markers of brain regions. The localization score gives rise to a generalized eigenvalue problem, 
and the solutions can have much higher localization scores than individual genes, but they are difficult to 
interpret because the sets of genes are very large and weighted by coefficients of alternating signs. The 
fitting score of sets of genes gives rise to sparse sets of markers. These two scores are easy to compute 
and to generalize, but they are both global in nature and they penalize genes that fit well the centermost 
part of a brain region, have low expression around the region, and high expression in remote parts of the 
brain. But such genes are interesting to detect, as they separate brain regions from their environment. A 
local fitting criterion is proposed, using the eikonal function in order to formalize the situation described 
above. Markers of pairs of regions are also investigated. They may be of special interest to evolution, 
especially for pairs of regions that are not equally well-identified in other species. 



2 Methods and models 

2.1 Gene expression energies and classical neuroanatomy 

The gene expression energies we analyzed were drawn from the Allen Gene Expression Atlas [2]. The 
steps taken in an automatized pipeline to produce those data for each gene can be summarized as follows: 
1. Colorimetric in situ hybridization; 
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2. Automatic processing of the resulting images. Find tissue area eliminating artifacts, look for cell- 
shaped objects of size ~ 10 — 30 microns to minimize artefacts; 

3. Aggregation of the raw pixel data into a grid. 

The mouse brain is partitioned into cubic voxels of side 200 microns (the whole brain consists of 
~ 50, 000 voxels). For every voxel v, the expression energy of the gene g is defined as a weighted sum of 
the grcyscale-value intensities / evaluated at the pixels p intersecting the voxel: 

E{v,g) , 

where M{p) is a Boolean mask worked out at step 2 with value 1 if the pixel is expressing and if it is 
non-expressing . 

Partitions of the brain (or of the left-hemisphere) of various degrees of coraseness in terms of classically- 
defined neuroanatomical regions were also published in the Allen Reference Atlas ( [T] , see also the white 
paper http : //mouse . brain-map . org/documentation/ index . html for the definition of expression ener- 
gies). 

The present analysis is focused on a subset of the genes for which sagittal and coronal data are avail- 
able from the Allen data. We computed the correlation coefficients between sagittal and coronal data 
and selected the genes in the top-three quartiles of correlation (this makes for 3041 genes) for further 
analysis. Of course the quantitative methods can be tested against larger datasets or different reference 
atlases, but the genes we selected are already numerous enough to motivate the use of computational 
methods to detect markers. 



2.2 Individual genes: localization scores 

Given a brain region of interest, say the cerebral cortex (call it u), define the localization score of a gene 
g as the fraction of the norm of its expression energy contained in the region: 

f E{v,gfdv 
J^Eiv,grdv^ 

where 51 is the whole brain. 

We computed the localization score of every gene in every region, for a given annotation of the brain. 
These numbers induce a ranking of genes as markers of each region of the brain, the better markers 
having higher localization scores. A perfect marker of uj according to this criterion would have a score 
of 1. Going from a region to another region, one has to be careful when comparing the values of the 
localization scores: as the volumes of the brain regions vary across the annotation, the localization score 
is biased by the sizes of the region. We need a reference in order to estimate how good a localization 
score is compared to what could be expected for a given brain region. We can use two references: 
• Uniform reference. Consider an indifferent marker that would be expressed uniformly across the 
brain. Its localization score in region uj is simply the relative volume of the region: 

A gene is a better marker of a; than expected from a uniform expression if its score A^^ (17) is larger than 

Ar'f. 
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• Average (data-driven) reference. A more realistic reference is given by the gene-cxprcssion profile 
averaged across all the genes: 

1 ^ 

3=1 

The corresponding localization score in a given region uj is: 

^average Joj ^ ^ 

A gene is a better marker of uj than expected from an average expression if its score A^j (g) is larger than 

\ average 

The values of these references, and the rankings of genes for w taken from the list of 12 largest regions 
in the left hemisphere, are presented in the results section and in appendices. 

2.3 Individual genes: fitting scores 

The criterion defined above docs not take into account the repartition of the signal inside the region of 
interest: the localization score for a given gene in region to is invariant under a transformation that moves 
the whole expression energy into a single voxel within w, leaving all other voxels in to with a zero signal. 
It is therefore desirable to have another ranking of genes as markers, that compares the gene-expression 
profiles to characteristic functions of brain regions. 

This criterion compares the shape of the expression energy profile of a gene and the shape of the 
region of interest. The fitting score 4>ui{g) of gene g in region u is defined as follows: 

<t>ui{g) ^^~~\ j {E^a™{'"^gf - Xi^iv)^ ^ j ^^°^^^'"'9)Xi^{v)dv. 

where Xi^ is the characteristic function of uj normalized in the sense, and Enorm is a normalized version 
of the expression energy (the columns of the matrix -Enorm a-re the columns of the matrix E, normalized 
in the sense): 

Enor:m{v,g) = , ^ 



^J^E{v,g)^dv /VolH 



A perfect marker of the region w would be a gene with fitting score equal to 1. The geometric interpreta- 
tion of this coefficient is as the cosine of the angle between the unitary vectors Eg and Xlj in voxel space. 
A perfect marker of region a; is a gene whose expression profile is colinear with the characteristic function 
of region uj. Again, this error function can be evaluated for all the genes in the datasct, and induces a 
ranking of genes (see the results section and appendices). 

2.4 Sets of genes: optimal localization scores as a generalized eigenvalue 
problem 

Consider the problem of optimizing the localization score of a set of genes, whose collective expression 
energy is taken to be a linear combination of the expression energies in our dataset: 

G 

Ea{v) ■.^^agE{v,g). 

3=1 
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where G = 3041 is the number of genes in our dataset. The previous analysis corresponded to vectors a 
with only one non-zero coordinate. 

The localization score in the brain region w of a set of genes is naturally written as 



^ ) \2 — ^ a 



where the quadratic forms J" and have coefficients given by scalar products of gene expression profiles 
across w and the whole brain: 



Jg.k^ / Eiv,g)E{v,h)dv, Jl\= / E{v,g)E{v,h)dv. 

We can fix an overall dilation invariance by fixing the value of the quadratic form in the denominator, 
and maximizing the localization factor boils down to a maximization of one quadratic form under a 
quadratic constraint. 

max„gR^GA[j(Q;) ~ max^gp^c q,* jn^^io; J"a. 

Introducing the Lagrange multiplier x associated to the constraint, we are led to maximizing the following 
quantity under a: 

L„^(a) = a* J"a - a{aKj^a - 1). 
The stationarity condition reads as a generalized eigenvalue problem, 

and the Lagrange multiplier is the largest generalized eigenvalue. Maximizing the generalized localization 
score is therefore equivalent to finding the largest generalized eigenvalue corresponding to the quadratic 
forms J'^ and J^. 

Of course the alternating signs of the coefficients make these sets difficult to interpret. But these 
algebraic sums provide absolute bests that one could not beat by taking combinations of genes with 
positive coefficients. The negative coefficients allow to offset the contribution of some genes outside the 
region of interest. 



2.5 Sets of genes: optimized fitting scores for sparse sets of genes 

As the optimal set of of markers is very hard to interpret due to alternating signs of components, we 
can take advantage of the simple quadratic structure of the error function used to compute fitting scores 
in order to obtain sets of markers with positive coefficients. Optimization of a quadratic form under 
positivity constraint is all we need to compute the optimal sets of markers. Let us write down the fitting 
error function for a set of genes and expand it in powers of the coefficients: 



ErrFit„(c) = 




= XI ^g^t^-^gh - 2 X cgjg + 1, (2) 
g,h g 
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where uj and 51 respectively denote the brain region of interest and the whole brain. The problem of finding 
thje best-fitting set of genes therefore boils down to the following quadratic programming problem under 
positivity constraints: 

c°P* = argmin^gj^GErrFittj(c) = argmin^gj^c ^-c' Jc — /*c 
with the following notations for the quadratic form J and the vector /: 

Jgh= I E{v,g)E{v,h)dv, 

fg= E{v,g)xuj{v)dv. 



The set of genes with strictly positive coefficients corresponds to the set of inactive constraints. It hap- 
pens to be much sparser than the vector encoding the generalized eigenvector for the cortex localization 
problem (see figure pTj) '). 

However, lots of secondary minima are guaranteed to exist when larger and larger sets of genes are 
taken into account, and coefficients c of very different norms can be hard to use to construct markers out 
of digitized data, as the absolute intensity of genes is quite heterogeneous, and a gene with low absolute 
intensity can happen to be weighted by a large coefficient, thus amplifying noise rather than contributing 
to a realistic marker. 

But we can take advantage of the expression of the fitting score in terms of the scalar product between 
the gene expression profile and the characteristic function of the brain region: 

ErrFit^(c) = 2 - j^^Y.'''9Eg{v)Xu{v)d^ , 



c°P* = argmax^g rp / y^CgEg{v)xu^{v)dv 
Jn g 



Another approach to the optimization problem consists in looking for sets of genes such that the co- 
expression between the sum of the expression energies of those genes and the characteristic function is 
larger than that of any individual genes. This can happen, for instance if the characteristic function in 
voxel space equals the sum of two genes, whose expression energies are two independent vectors in voxel 
space: the cosine of the angle between any of these two vectors with the characteristic function is strictly 
smaller than one, but the cosine of the angle between the sum and the characteristic function equals one. 



This is a finite problem, even though the number of subsets is extremely large. We impose a maximum 
Gmax on the number of genes we want to accept, and adopt a bootstrapping approach: we repeatedly 
draw random subsets of size Gmax from our set of genes, and keep the subsets that beat the record fitting 
score (this record is initialized at the highest fitting score for an individual gene). 



2.6 Separation properties 

The methods described so far are global in nature in the sense that the error functions involve sums of 
expression energies over the whole brain. This corresponds to evaluating how a brain region is singled out 
with respect to the rest of the brain. No attention is paid to the position of the voxels that contribute to 
the error functions: a voxel with high expression in the cerebellum will penalize a gene as a merker of the 
striatum, no more but but no less than if it was in the ventral pallidum. However it may be interesting 
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to detect genes that separate some brain region from its environment, without necessarily highlighting 
these brain regions in an exclusive way. The description of such a situation implies a more local error 
function. 

However, when looking at the expression profile of a gene in the neighbourhood of a particular brain 
regions, one can sometimes notice that the region is well-separated from the rest of the brain, because the 
expression is high in voxels close to the center of the region, and locally declines around the center. At 
large distances from the center, the details of the gene-expression profile matter much less, as long as this 
pattern of decreasing expression from center to boundary is detected. Such genes have good separation 
properties. 

The separation property we described above corresponds to the situation where the gene-expression 
pattern looks like a plateau around the center of the region oj, and gradually fades away when the 
boundary of the region is crossed. Of course the notion of center of a brain region needs to be defined 
more precisely. So does the notion of distance to a brain region. The eikonal distance to the boundary 
of the region is a geometric quantity that is well adapted to this problem, as it measures the minimal 
distance traveled by light emitted from the boundary of the region [5]. In order to control how far from 
the center of a region a voxel is, one can therefore solve the eikonal equation with boundary conditions 
on the boundary of the region: 

|V/i„| = 1, 

huldu = 0. 

The eikonal distance has been used used to place injections in the brain in a way that preserves the bound- 
aries of regions defined by the Allen Atlas [6l[7l[9] . It is also a useful tool to evaluate the misalignment of 
skulls and skull variability in stereotactic protocols |Hj . The equation is solved using level-set methods |3] . 

We define a model function that detects the most central part of the region w, using the eikonal 
function as a measure of centrality. The function is positive It is a plateau in the central part of uj, and 
fades away across voxels that are more peripheric to lu. More specifically, let us define the eikonal radius 
Puj of the region w as the maximum value of the eikonal function inside the region: 

:= max„g„/i„(w). 

Let us first apply a mask to the eikonal function, with negative signs outside the region and positive signs 
inside: 

hT'^{v) := K{v) X {l{v ecu)- l{v i uj)) . 

Our model function equals one around the center of the region, where /i^'snod positive and larger 
than a specified fraction of the eikonal radius, given by a certain fraction t of the eikonal radius. It equals 
zero where /i^'^ncd negative and larger in absolute value than another specified fraction of the eikonal ra- 
dius. The values are interpolated between these two regions according to the values of the eikonal function. 

Having defined this local characteristic function t,u} around the brain region of interest, one can treat 
the support of t,u} as we treated the whole brain in the previous computations, and adapt the various 
quantitative criteria by making the following substitution: 

Global < — > Local, 

= whole brain < — > f2 = Supp^^^, 
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This substitution expresses the fact that the expression energy outside the support of the local charac- 
teristic function of uj can be very singular of very intense without affecting the separation properties. 

2.7 Co-markers for pairs of brain regions 

The various quantitative criteria can be repeated for reunions of brain regions. For instance one can look 
for a marker of the two brain regions uia and ujb ■ Ideally one would like the expression profile of a marker 
gene to look like the sum of the two characteristic functions of regions A and B, normalized in the 
sens^H- But it may be interesting to allow the two characteristic functions to be weighted by coefficients, 
in order to detect genes whose expression looks like two bumps, one centered around A, one centered 
around B, with possibly different intensities. 

Consider the two characteristic functions XA and xb, normalized in the sense, and a linear com- 
bination thereof with positive coefficients, normalized in the same way. The coefficients of the linear 
coimbination can be interpreted geometrically in terms of a single parameter, which is an angle between 
and 7r/2. Let us denote it by 0: 



/ XAiyfdv = 1, SuppxA = UJA, 
Jn 

/ XBivfdv = 1, Suppxs = u}B, 
Jn 

aXA + PXB, a>0, (3>0, [ = + /?' 



Jn 

a = cose*, /3 = sin 6*, < 6* < ^, 

where we have used the fact that the functions xa and xb are orthogonal, because they have disjoint 
supporty. Geometrically, the function x we are trying to fit is the sum of two unit orthogonal vectors 
in voxel space, that sits on the intersection of the unit circle and the first quadrant in the two-plane 
spanned by these two vectors. Wc can compute the fitting error for each gene at fixed angle 9, but it can 
be optimized wrt the angle: 

ErrFitA,s(ff,0) = / g) - (cos 0xa(w) + sin 0Xi3 («)))' dz; (3) 

Jn 



= 2|^l-cos6' J E{v,g)xAiv)dv - sin 9 J E{v, g)xB{v)dv j. (4) 

This optimization step corresponds to the fact that the angle between a fixed vector in voxel space 
(corresponding to a gene), can have a lower angle with a two-plane than with any of the vectors of an 
orthonormal basis of the two-plane. The optimal angle 9* is given by the equation: 

^ErrFit^,s(5,^?*) = 0, 

i.e. -sin6'* / E{v, g)xA{v)dv + cos9* / E{v, g)xB{v)dv ~ 0, 
Jn Jn 



^Onc can as well look for genes that separate regions A and B from their respective environments, by considering the 
local characteristice functions worked out using the eikonal functions with boundary conditions at the boundaries of A and 
B, rather than the characteristic functions. 
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i.e. 6* = arctan 



Jn£^(«,5)XA(f), 

The value of the error function at the optimal angle (meaning the linear combination of the two 
characteristic function with positive coefficients that is best fit by gene g) is then evaluated in terms of 
the scalar product between the gene expression and the two characteristic function XA and xb'- 



COS0* = 



sinr 



J^E{v,g)xA{v)dv 



so that 



\/ {InE{v,g)xA{v)dv] 


f + {LE(''^9)XBiv)dvy 




9)XB{v)dv 


\J {In^('"^s)XA{v)dv] 


\^ + {J^^E{v,g)xB{v)dvf 




2,1. \ 2 



cj>A,B{9.0) = 1 - ^EivFit A,B{g,0*) = y ^j^^E{v,g)xA{v)dvj + ^j^E{v,g)xB{v)dv 

This score is comprised between and 1, as the fitting score evaluated for the fitting of a single region by 
a gene. So, given a non-hierarchical atlas one can find better fittings for pairs of regions in the atlas 
than for single regions. 

These scores can be computed for all pairs of regions in a given non-hierarchical atlas. Of course there is 
no reason why the top co-marker of regions A and B should be especially more impressive than the best 
marker of A or B. By the look of the expressions of the coefficients cos 6* and sin 9* , it is clear that in the 
case where J^^ E[v, g)xB{v)dv is much smaller than J^^ E[v, g)xA{v)dv, its score as a co-marker of regions 
A and B is slightly larger than its score as a marker of A, but most of the expression will of course be 
in the A. The value of tan 6* controls the balance between the expression energies in the two regions. 
The closer it is to 1, the better co- marker we have. Asking for a value of exactly one would amount to 
trying to fit the sum of the characteristic functions of regions A and B without Once the genes have been 
ranked as co-markers of A and B, one can filter out the genes for which t&nO* is out of a tolerance zone 
around 1. This is a balance constraint. The genes at the top of the ranking that do not satisfy it are 
rather markers of the region {A or B) that has the highest coefficient. The genes that satisfy it are the 
co-markers we are after, and they are penalized by the localization and fitting criteria, both for region A 
and for region B: 

Balance constrainti- = I tan 9* — 1\ < t. 



3 Results and discussion 
3.1 Rankings of genes 

A plot of the sorted localization scores of individual genes is shown on figure [T]for the cerebral cortex, as 
well as a table of the best few marker genes. Tables for all the other brain regions in the coarsest Allen 
Reference Atlas are included in an appendix. The maximum-intensity projections of the best marker of 
the cerebral cortex in the left hemisphere, and compared to those of the characteristic function of the 
cerebral cortex are shown on figure [21 The sorted fitting scores and the list of top genes for the cerebral 
cortex are shown on figure ([3]) . A coronal section of the ISH data for Satb2 is shown on figure ^ . The 
cerebral cortex indeed appears strikingly on the section. However, Satb2 is not the absolute best gene 
according to the localization criterion, which is Pak7, but it is still among the best 10 genes by localization 
scores. A coronal of the ISH data for Pak7 is shown on figure ([6]). Maximal- intensity projections of the 
registered 3D data on figures ([2]) and ([4|) show indeed that the expression energy of Satb2 is more evenly 
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Sorted localization scores for 
Cerebral cortex 
uniform reference in green, average reference in blue 





Cerebral cortex 


Pak7 


0.97 


Myl4 


0.97 


Sytl2 


0.97 


Tnncl 


0.96 


Gtdcl 


0.96 


2310026E23Rik 


0.95 


Dact2 


0.95 


Satb2 


0.94 


lerS 


0.93 


D830014K04Rik* 


0.93 


Baalc 


0.92 


1810023C24Rik* 


0.92 


Ddit41 


0.92 


LOC433228 


0.91 


Rorb 


0.91 


Gnb4 


0.9 


A930001M12Rik 


0.89 


LOC433698 


0.89 


E430002G05Rik 


0.89 


TC1460681 


0.89 


Tox 


0.89 



Figure 1. Plot of sorted localization scores in the cerebral cortex (left), with the list of the first few 
genes with highest localization scores in the cerebral cortex (right). 



Sagittal Coronal Axial ^ Sagittal Coronal Axial ^ 



Figure 2. Heat map of the maximum-intensity projection of Pak7 (left), the best marker of the cortex 
in the sense of localization scores, compared to the charactcrictic function of the cerebral cortex (right). 
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Sorted fitting scores for 
Cerebral cortex 






Cerebral cortex 


Satb2 


0.89 


KcnhT 


0.85 


Ephb6 


0.84 


3110035E14Rik 


0.83 


Homerl 


0.83 


Fhl2 


0.83 


Pak7 


0.83 


KlflO 


0.83 


Dusp3 


0.83 


Cckbr 


0.83 


1110008P14Rik 


0.82 


Tbrl 


0.82 


IgsfOb 


0.82 


Stxla 


0.82 


A230097P14Rik* 


0.81 


D430041D05Rik 


0.81 


Mal2 


0.81 


IgfbpG 


0.81 


Nik 


0.81 


Arc 


0.81 



Figure 3. Plot of sorted fitting scores in the cerebral cortex (left), with the list of the first few genes 
with highest fitting scores in the cerebral cortex (right). 



Sagittal Coronal Axial ^ Sagittal Coronal Axial ^ 



Figure 4. Heat map of the maximum-intensity projection of Satb2 (left), best marker of the cortex in 
the sense of fitting scores, compared to the characterictic function of the cerebral cortex (right). 
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Figure 5. A coronal section of the ISH of Satb2. Satb2 has the highest localization score in the 
cerebral cortex. The concentration of blue precipitate in the region is manifest. 



distributed across cortex than the one on Pak7, which makes Satb2 closer to the characteristic function 
of the cerebral cortex. 

3.2 Rankings of regions 

For each region w, we can count the number of markers as the number of genes whose localization score 
in the region is larger than the fraction of the brain occupied by the region, as defined by the uniform of 
average references. The results are illustrated on figure ([7|). 

Moreover, the two reference values defined above using either the volumes of the brain regions or the 
average of the expression of all genes in the dataset show important distorsions, as can be seen from 
the table p^ . In particular, the gene-expression profiles are biased towards the cerebral cortex and the 
hippocampal region, as the value of A|J™g|^f°j^^j.^^^ is higher than 47 percent, while the value of A^'^^j^^j.^^^ 
is lower than 30 percent. This distorsion is manifest of the maximal-intensity projection of the sum of 
all expression-energy profiles across the dataset, shown on figure ([5]). 

The highest scores across all genes and brain regions are for the cerebellum, both by localization 
and by fitting. This comparison across all regions and genes makes more sense for fitting scores than 
for localization scores, as the fitting scores is not biased by the volume of the region. Both genes make 
good sense optically as markers of the cerebellum. The best-localized gene is Gabra6, at 98 percent 
localization score (see figure ©). It is the 73rd best- fitted gene to the cerebellum. The best- fitted 
gene is 311000lAl3Rik, at 89 percent fitting score (see figure pH]) ). It is the 2nd best-fitted gene to the 
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Figure 6. A coronal section of the ISH of Pak7. Pak7 has the highest fitting score in the eerebral 
cortex. The concentration of blue precipitate in the region is manifest. 



Region name 


Nb of genes above average ref. 


Nb of genes above volume ref. 


Basic cell groups and regions 


3041 


3041 


MeduUa 


1418 


920 


Pons 


1315 


600 


Cerebellum 


1230 


675 


Olfactory areas 


1210 


1241 


Thalamus 


1144 


614 


Midbrain 


1126 


381 


Hippocampal region 


1055 


1544 


Pallidum 


1038 


276 


Hypothalamus 


1007 


456 


Retrohippocampal region 


998 


1626 


Striatum 


855 


479 


Cerebral cortex 


782 


1791 



Figure 7. Ranking of regions by decreasing number of genes with localization score above average 
reference. 
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Figure 8. Heat map of the maximal-intensity projections of the sum of tlie expression energies across 
all genes in the dataset. The cerebral cortex and the hippocampal region are clearly visible. 




Figure 9. Absolute best localization across all genes and regions. Heat map of the 
maximal-intensity projections of the expression energy of Gabra6. 

cerebellum. The distorsion is much larger for Gabra6 because its expression profile shows inhomogeneities 
inside the cerebellum. For 3110001A13Rik, the localization is optically extremely good. 



3.3 Sets of genes 

Sorting the coefficients of the generalized eigenvector associated to the largest generalized eigenvalue for 
the cerebral cortex yields a profile (illustrated in figure PT|) ), which has coefficients of both signs (as 
above w is chosen to be the cerebral cortex), with a localization score of 0.979 (higher than the optimum 
for single genes, as it should be). 

It is interesting to note that in some regions the gene-expression profile correponding to the generalized 
eigenvector looks much more coherent as a marker than the best-localized gene. More examples can be 
found in figure ([22]). Taking combinations of genes therefore enables one to get closer to the characteirstic 
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Figure 10. Absolute best fitting across all genes and regions. Heat map of the 
maximal-intensity projections of the expression energy of 3110001A13Rik. 
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Figure 11. Optimal set of marker genes in the sense of generalized localization scores. Sorted 
components (left), heat map of the maximal-intensity projections of the expression energies of the genes 
weighted by these coefficients (right). 

function of the regions. It is therefore tempting to go back to the fitting scores and to adapt it to sets 
of genes, with positivity constraints that would produce sparser sets of genes. Midbrain is a brain region 
for which the generalized eigenvector gives a much better visual impression of the whole structure than 
the best-localized gene. The generalized fitting score, in the case of midbrain, gets rid of much of the 
signal outside the region but does not achieve much homogeneity inside. Generally speaking, and not too 
surprisingly, the sparse sets of genes are much less impressive markers than the generalized eigenvectors, 
as they rely on much fewer degrees of freedom for optimization. On the other hand, the best individual 
marker returned by the global fitting criterion is much more convincing than the one returned by the 
localization score. 

3.4 Good separators 

For every region in the coarsest annotation of the left hemisphere, we ran an algorithm with a range 
of values of the internal an external paramater for the model function and for the local mask. As the 
boundary of the striatum does not have too much overlap with the boundary of the brain, it is easier to 
visualize than the cerebral cortex in a maximum-intensity projection and we chose it for illustration (see 
figure ([24]) for the first 10 genes returned by the algorithm, none of which has better rank than 218 for 
localization and 70). 

Since the maximum-intensity projection can hide some well-sepearatcd regions, it is not as reliable 
as sections to evaluate separation properties, but still it is instructive to see how this local criterion 
can return genes that score low for localization and/or fitting but still have a distinct pattern around 
striatum. Slc32al has a clear but inhomogeneous pattern in striatum, and a high expression in the main 
olfactory bulb. Both features penalize the global fitting score, but only the second one penalizes the 
global localization score, which is consistent with the fitting rank being much lower than the localization 
rank. Ptpn5 exhibits a less contrasted but more homogeneous pattern in the striatum (it rather follows 
the caudoputamen than the striatum), but the expression in also quite high in the cerebral cortex. 
Caudoputamcn is still striking and the gene was rescued by the local algorithm, even though the expression 
in the cerebral cortex severely penalizes PtpnS both for glocal localization (according to which it is ranked 
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Figure 12. Optimal set of marker genes in the sense of generalized fitting scores. Heat map of the 
maximal-intensity projections of the (normalized) sum of the expression energies of the genes. 




Figure 13. Best separator of striatum. Heat map of the maximal-intensity projection of Slc32al. 

2056) and global fitting (according to which it is ranked 634). It would be interesting, when repeating 
the experiment for the same gene, either in the same species or in different species, to see if the local 
separation property is as the global properties measured by the localization and fitting scores. A sagittal 
section drawn from the ISH data (figure (fT5|)) confirms that Ptpn5 is highly expressed in the striatum, 
but also in the cortex, albeit to a lesser extent. The separation between cerebral cortex and striatum 
is clearly visible on the section. This is the separation property that our local criterion is supposed to 
detect. 

3.5 Good co-markers 

A table of the best few co-markers for striatum and cerebellum (see figure ([26])) can be found in an ap- 
pendix. It was obtained at a value t = 0.5 of the r-balance constraint defined above. This value is 
somewhat arbitrary and and by the look of the tangent coefficients and fitting scores for the genes, there 
is no monitonicity of the value of the tangent wrt the value of the score. The user of the software can 
input a higher value of r in order to explore genes with a higher tolerance on the relative fittings. We do 
not have a natural optimization criterion to propose to choose t and therefore leave it as a parameter. It 
can be noted, however, than the r-balance constraint at r = 0.5 for pallidum and cerebellum returns an 
empty set of markers. Thus, fixing the level of the balance constraint and counting the number of marker 
genes returned by the algorithm suggests an indication on the degree of solidarity between pairs of regions. 

The best two co-markers of the cerebellum and the striatum are Id4 (see figure (fTH]) ) and D330017J20Rik 
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Figure 14. Third best separator of striatum. Heat map of the maximal-intensity projection of 
Ptpn5. 
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Figure 16. Best co-marker of striatum and cerebellum. Heat map of the maximal-intensity 
projection of Id4. 




Figure 17. Second Best co-marker of striatum and cerebellum. Heat map of the 
maximal-intensity projection of D330017J20Rik. 

(see figure (fTT)) ). The first one has a value of tangent very close to one, the second has a value of tan- 
gent close to 0.86. Id4 has indeed a more homogeneous expression across striatum and cerebellum, and 
D330017J20Rik has a pattern of higher expression inside cerebellum, hence a tangent more remote from 
one, but the two genes show a clear pattern for the pair striatum-cerebellum. 
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4 Appendix: Reference values of the localization scores for the 
coarsest atlas of the left hemisphere 
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Figure 18. Values of the uniform and average reference scores for each of the 12 main regions of the 
left hemisphere. 
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5 Appendix: Best-localized genes and characteristic functions 
for the 12 main regions of the left hemisphere 
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6 Appendix: Best-fitted genes and characteristic functions for 
the 12 main regions of the left hemisphere 
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7 Appendix: Best sets of genes for localization in the twelve 
main regions of the left hemisphere 
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Figure 21. Expression profiles of the sets of genes that maximize the localization score for each of the 
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8 Appendix: Best sets of genes for fitting in the twelve main 
regions of the left hemisphere 
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9 Appendix: Best separators 
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Figure 23. Characteristic function of the reunion of the striatum and the cerebenum. 
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10 Appendix: Best co-markers 
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Figure 25. Characteristic function of the reunion of the striatum and the cerebenum. 
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